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* ABSTRACT 

We present a simple and near optimal randomized parallel scheduling algorithm for scheduling 
packets in routers based on the Switch-Memory-Switch (SMS)architecture, which emulates 'output 
queuing' by using a collection of small memories within the switch to buffer packets, and which forms 
the basis of the fastest routers in use today. For a router with N inputs and N outputs, our algorithm 
computes the schedule in 0(log* N) rounds, where a round is a communication of a few bits between 
input ports and memory together with simple local computation at the inputs and memory. 
Furthermore, by using an 0(log* N) deep pipeline at each input, our algorithm computes the 
schedule in a constant number of rounds. Our pipelined algorithm is quite simple and achieves 
optimal (i.e., constant) throughput with a tiny 0(log* N) delay. We show that the total amount of 
buffer memory required by our algorithm is close to the minimum required. We also show that the 
number of buffer memories is within an eN additive term of 2/V — 1, for any positive constant u>0 
(and is within an additive term of o(/V)for the basic scheduler), where 2/V — 1 is the minimum number 
of memories needed under adversarial placement of packets. Furthermore we show that the number 
of extra memories that we use over the minimum of N that is required in the offline version, is within 
a constant factor of the minimum required by any on-line scheduler, even if that scheduler is allowed 
to fail occasionally.Our scheduling algorithm is randomized and works with high probability in /V. We 
also prove that it has the 'self-stabilizing' property, i.e., it resumes its normal behavior if occasional 
lapses occur due to the probabilistic nature of the algorithm. 
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ABSTRACT 

We present a simple and near optimal randomized parallel 
scheduling algorithm for scheduling packets in routers based 
on the Switch- Memory- Switch (SMS) architecture, which 
emulates 'output queuing' by using a collection of small 
memories within the switch to buffer packets, and which 
forms the basis of the fastest routers in use today. For a 
router with N inputs and N outputs, our algorithm com- 
putes the schedule in 0(log* N) rounds, where a round is a 
communication of a few bits between input ports and mem- 
ory together with simple local computation at the inputs and 
memory. Furthermore, by using an 0(log* N) deep pipeline 
at each input, our algorithm computes the schedule in a con- 
stant number of rounds. Our pipelined algorithm is quite 
simple and achieves optimal (i.e., constant) throughput with 
a tiny 0(log* N) delay. 

We show that the total amount of buffer memory required 
by our algorithm is close to the minimum required. We also 
show that the number of buffer memories is within an eN 
additive term of 2N- 1, for any positive constant e > 0 (and 
is within an additive term of o(N) for the basic scheduler), 
where 2N — 1 is the minimum number of memories needed 
under adversarial placement of packets. Furthermore we 
show that the number of extra memories that we use over 
the minimum of N that is required in the offline version, is 
within a constant factor of the minimum required by any 
on-line scheduler, even if that scheduler is allowed to fail 
occasionally. 

Our scheduling algorithm is randomized and works with 
high probability in N. We also prove that it has the 'self- 
stabilizing' property, i.e., it resumes its normal behavior if 
occasional lapses occur due to the probabilistic nature of the 
algorithm. 
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1. INTRODUCTION 

Routers play a critical role in modern computing of all 
forms including wide- area networks, multiprocessor servers, 
and data storage systems [15, 24, 11, 4, 9, 12, 29] (see also 
[13, Chapters 7.12, 8.12]). Modern routers achieve high per- 
formance by solving computationally intensive tasks using 
custom hardware. One of the most challenging problems 
in designing a high-end router is scheduling the transfer of 
packets from inputs to outputs. 

A router used to be nothing more than a general pur- 
pose computer connected via a standard bus to hardware 
for transmitting and receiving packets over links. This was 
because the link bandwidth was low enough for a general 
purpose processor to implement the entire router functional- 
ity. With the advent of high-speed fiber optic technology [26, 
27], the situation lias reversed, and in many networks today 
routers are the bottleneck in moving data. 

Given that the cost of deploying and maintaining links far 
exceeds the cost of router hardware [15, Page 203] the trend 
has been to use quite extensive hardware in the router. Some 
of the tasks performed by routers can be accelerated us- 
ing brute-force solutions, e.g., by demultiplexing high-speed 
links and using replicated hardware. However the task of 
quickly transferring packets from inputs to outputs has not 
been solved satisfactorily so far, largely because of the com- 
plex co-ordination problem that is associated with it. 

Figure 1 (a) shows the block- level architecture of a router. 
Packets are assumed to be of a fixed size. (IP network pack- 
ets can be variable sized; this is dealt with by segmenting 
them into fixed size packets at the input port, and reassem- 
bling them at the output port [24, Page 203].) Input line 
cards (or input ports) take packets from incoming links, and 
compute the output link to which the packet is to be for- 
warded. (It is assumed that the output link is determined 
by the final destination of the packet, and is not within the 
control of the scheduler.) The switch fabric transfers packets 
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to the output ports, which transmit the packets on outgoing 
links. Peterson and Davie [24, Chapter 3] and Keshav and 
Sharma [16] survey router architectures. 

Logically, the router operates in cycles: in each cycle, at 
most one packet may arrive at an input port. The cycle time 
is defined to be the amount of time between cycles; ideally 
it is equal to the link bandwidth divided by the packet size, 
unless the router requires large cycle time to be able to per- 
form all the tasks that it needs to do on every packet, which 
is currently the case. 

We restrict our attention to routers that have N input 
ports and N output ports, with all links having the same 
bandwidth. At the beginning of every cycle, the router re- 
ceives at most one packet at each input and transmits at 
most one packet on eacli output. The arrival time of a packet 
p is the cycle in which p arrived at the input of the router; 
the departure time of p is the cycle in which p is transmit- 
ted from the output. The difference between departure and 
arrival times of a packet is called its latency. 

Two (or more) packets destined for the same output port 
can arrive at different input ports in the same cycle. Conse- 
quently, one of the two packets will have to be buffered [24, 
15, 14]. This buffering can be performed at the input ports, 
within the switch fabric, and at the output ports. Because 
of contention for a shared output link, a link may become 
congested; when the number of packets waiting for the link 
exceeds the buffer capacity, packets will be dropped [15, 
Chapter 8.5]. 

At any given time, a router may have a large number of 
packets, enqueued in different queues, waiting to be trans- 
mitted through different outputs. In a single cycle only a 
subset of these queues can be advanced based on the con- 
straints imposed by the architecture of the router. Routers 
need to make scheduling decisions about which queues get 
advanced in each cycle. The average latency that packets 
observe at the router as well as the number of packets that 
get dropped by the router because of buffer overflow greatly 
depend upon the scheduling decisions made by the router. 
Thus it is essential to have an efficient scheduler. In a router 
with a large number of input and output ports, the schedul- 
ing algorithm often takes more time to compute the schedule 
than the router takes to transmit the packets. This paper 
introduces a fast scheduling algorithm; we are motivated 
by the fact that the schedule must be computed within the 
cycle time. 

A router is said to be output- (jueued if packets are buffered 
solely at the outputs. Output queuing is strongly preferred 
for a number of reasons [23]. For example, it minimizes the 
average queuing delay faced by packets. It also guarantees 
that the relative ordering of packets is preserved. However, 
buffering packets solely at the output ports requires very 
high-speed memories and switch fabrics. Specifically, in an 
N input router, N packets for the same output can arrive in 
a cycle; consequently, the memory at the output port should 
be able to support N writes in a single cycle. 

In an input-queued router, packets are buffered solely at 
the inputs. The advantage of an input-queued architecture 
is that the buffer memory need only to be able to support 
one read and one write in a cycle. However, it is extremely 
difficult to schedule packets for departure across the switch 
fabric in such an architecture — naive approaches result in 
high drop rates [14], and more sophisticated approaches are 
too complex to run within the cycle time [20]. 



The switch-memory- switch (SMS) architecture buffers pack- 
ets in small memories placed between the input and output 
ports. In this architecture, the output ports have buffers 
that need to hold just one packet, and the input ports have 
buffers of small size. Thus the main buffers in this architec- 
ture are the small memories placed between inputs and out- 
puts, which operate together. This is the architecture used 
by the fastest routers available today, the Ml 60 and T640 In- 
ternet core routers from Juniper Networks [22]. (The power 
of this architecture can be seen in the fact that within three 
years of its inception Juniper Networks took over from Cisco 
as the leading provider of routers for the Internet core.) 

There are three main advantages to using an SMS archi- 
tecture over other architectures: 

(1) The average delay can be minimized (as in output queu- 
ing) 1 

(2) The buffer memories need to support only one read and 
one write per cycle (as in input queuing), 

(3) With a good scheduling algorithm, the packets can be 
distributed almost equally among the buffer memories to 
make sure that a packet gets dropped only if all the buffers 
are full (thus the same packet drop rate can be achieved 
with smaller memories as compared to an output-queued or 
input-queued switch) . 

In this paper we present a near optimal scheduler for the 
SMS architecture. The scheduler is described in Section 4 
and its memory requirements, which are also close to opti- 
mal, are analyzed in Section 5. 

1.1 Prior Work on Router Scheduling 

Early routers used sequential algorithms; however, this 
is not an option with modern link speeds. Broadly speak- 
ing, recent parallel algorithms for scheduling have one or 
both of the following shortcomings: 1.) they are ad hoc, 
working well on some cases and very badly in others [19, 5, 
6], or 2.) they involve pointer- manipulating algorithms that 
are unacceptabiy complicated even in the context of a large 
budget for dedicated hardware [25]. McKeown et al. [19] de- 
scribe a heuristic parallel algorithm for scheduling in input- 
queued switches. However, its performance depends greatly 
on the incoming traffic, and there are natural traffic patterns 
for which it has an unacceptabiy high drop probability [6]. 
Prakash et al. [25] proposed an 0(log 2 N) parallel algorithm 
based on pointer jumping for scheduling packets in the SMS 
architecture; as in [7], this router emulates an output-queued 
router. However, the algorithm is impractical to implement, 
since it uses the NC algorithm in [17] to edge-color bipar- 
tite graphs. Chuang et al. [7] have shown that a router with 
buffering at both the input and output ports can emulate an 
output-queued router by performing 2 reads and 2 writes on 
the input and output buffers, respectively, and running the 
switch fabric twice in a cycle. Their approach hinges on a 
sophisticated scheduling algorithm which solves an instance 
of the stable marriage problem, which is again impractical 
to implement in hardware. 

2. THE SMS ARCHITECTURE 

Since we use the switch- memory-switch (SMS) architec- 
ture presented in [25], we review the architecture and key 
results in that paper. We defer a discussion of the details of 
the model of computation to Section 3. 

Figure 1 (b) depicts the SMS architecture. The set of input 
ports is connected via an iV x M interconnect to M mem- 
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Figure 1: (a.) Architecture of a generic router, (b.) The Switch-Memory-Switch (SMS) architecture. 



ories; these M memories are connected to the set of output 
ports through another interconnect. Eacli of these memories 
are of size K\ we assume K > N (in practice, K ^> N). In 
every cycle one packet can be read from and s packets can 
be written to each memory. Not surprisingly, we will show 
that as s increases the requirement on M goes down. Thus 
if memory bandwidth is the bottleneck in the system then it 
would be desirable to use s = 1 but otherwise one can boost 
s as much as possible to reduce M. One can also consider 
the case of using memory banks that supports s reads and 
s writes every cycle. But that would be equivalent to using 
sM memory banks that support one read and one write ev- 
ery cycle in our scheme. Since this case is already captured 
in the analysis we do not consider it as a separate case. 

2.1 Emulating output queuing 

Since output-queuing is highly desirable (cf. Section 1), 
our goal is to emulate the behavior of an N x N output- 
queued switch that has buffer memory space for L packets 
at each output using an SMS architecture. By emulation, we 
mean that for any arrival sequence (1) a packet is dropped 
by the SMS router iff it will be dropped by the output- 
queued router, and (2) if a packet is not dropped then the 
cycle in which it departs the SMS router must be same as 
the cycle in which it would have departed the output-queued 
router. 

The cycle in which a packet would have departed an output- 
queued router is referred to as its time-stamp. When a 
packet arrives at an input of an SMS router, its time-stamp 
is computed as described in section 2.4. In each cycle, pack- 
ets at the inputs are written to a subset of memories through 
the first interconnect, and packets whose time-stamp is equal 
to the current time are read from the memories and trans- 
ferred to the outputs through the second interconnect. 

2.2 Conflicts 

In the SMS architecture each memory can support one 
read and s writes per cycle. Hence packets cannot be arbi- 
trarily placed in the memories. A packet faces two kinds of 
conflicts. More than s packets that arrive at the same time 
cannot be written to the same memory; this is referred to 
as an arrival conflict Since there are N input ports, the 
maximum number of arrival conflicts a packet can have is 
\(N — 1)/*|. Departure conflicts occur if multiple packets 
in the same memory need to depart simultaneously through 



different outputs. Since there are N outputs, a packet can 
have departure conflicts with at most TV— 1 memories. Hence 
if the number of memories M > \(N — l)/s] + N there will 
always be a conflict-free memory for each packet. A conflict- 
free memory for an input is said to be compatible with that 
input. 

2.3 Scheduler tasks 

In order to construct a conflict-free schedule for transfer 
of packets the scheduler has three tasks to perform in every 
cycle. 

Task 1 Compute the time-stamp of all the newly arrived 
packets. 

Task 2 Match the newly arrived packets to memories such 
that there are no departure and arrival conflicts. 

Task 3 Read packets whose time-stamp is equal to the cur- 
rent time and transfer them to the output. 

Since the time-stamp of a packet is known when it is written 
to a memory, Task 3 is simple. We briefly describe how Tasks 
1 and 2 are performed. Task 2 is the most complex step and 
is the focus of this paper. 

2.4 Task 1: Time-stamp computation 

An array E[l . . . N] stores the earliest available time-slot 
for each output. 

Let P° through P°o be the packets destined for output 
port o that arrived in the cycle T and let them be ordered 
according to the id of the input port they arrived. Then 
time-stamp of packet P° is set to (E[o\ + i) and E[o] is set 
to inax((£[o] + c°,T). This time-stamp assignment is con- 
sistent with the requirement of emulating an output-queued 
router, and can be efficiently computed by simple circuitry. 

If the difference of time-stamp of a packet and current 
time is greater than L then it is dropped. This behavior 
is consistent with the behavior of an output-queued router 
with buffer of size L at each output. 

2.5 Task 2: Scheduling using graph matching 

For routers that are relatively small and slow, the SMS ar- 
chitecture can emulate output-queuing by using a straight- 
forward greedy sequential algorithm to compute an assign- 
ment of incoming packets to compatible memories. However 
for routers with many ports operating at high speeds, the 
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sequential algorithm is not fast enough to compute the as- 
signment. The only known parallel algorithm for computing 
the assignment is that of Prakash et al. [25]; however, it has 
the disadvantages mentioned in Section 1,1. 

3. COMPUTATIONAL MODEL 

In Section 4 we describe simple and fast algorithms for 
Task 2. In this section we describe the main features of the 
abstract model of the interface between the input ports and 
the memory banks in the SMS architecture. 

• There are N input ports, each with a buffer that can 
hold J packets. At each input port, the current packet 
is the packet at the head of that input buffer. In our 
basic algorithm /is a constant; in the pipelined version 
/ = 0(log* N). There are N output ports, which need 
to buffer only one packet each. 

• There are M > N memory banks, and eacli can hold 
up to K packets. Our schedulers work for M = (1 + 
(l/s) + t)N, where t is either an arbitrarily small con- 
stant or is o(l) as described later. 

• There is simple hardware at the input ports as de- 
scribed in [25] (and summarized in Section 2.4 of this 
paper) that computes the departure time stamp for 
each current packet at the start of each cycle, based 
on the packet's output port. 

• Each input port and memory bank lias 0(log N) depth 
circuitry of size O(N). Note that as routers become 
larger, distributing the hardware for computing the 
schedule across the input ports and the memory banks 
is preferable to having a separate centralized process- 
ing unit. 

• There is a dedicated wire connecting every (input port, 
memory bank) pair. This investment in hardware is 
not considered excessive if the wire needs to support 
transfer of only a few bits per cycle (see, e.g., [2, page 
6], [21]). With this hardware support, each input port 
can send a short message to each memory bank (and 
vice versa) in one communication step. At the receiv- 
ing end the identity of the transmitting node can be de- 
termined by examining the wire along which the mes- 
sage arrives. We will refer to such a communication 
step as a transmit step. 

Under current technology, the time taken by a transmit step 
dominates the cost of O(logJV) time computation in hard- 
ware at a single input port or memory bank. However, it is 
considerably faster than the time taken to transfer a packet 
through the crossbar, since a packet is typically hundreds of 
bits long. 

4. THE SCHEDULING ALGORITHMS 

In section 4.1 we describe a basic randomized scheduling 
strategy for matching input packets to compatible memory 
banks. We measure performance in terms of rounds, where 
a round is a transmit step together with 0(log N) time com- 
putation at each input port and each memory bank. Our 
basic scheduler runs in 0(log* N) rounds. 

In section 4.2 we present a pipelined version of our basic 
scheduler with a latency of 0(log* N) rounds, but with the 



improved performance of constant throughput. Thus in this 
scheme the lag between successive transfers of of packets 
from input ports to memory banks is a constant number of 
rounds. Since in many networks, the limiting feature for 
the cycle time is the router and not the link speed, this will 
have the desirable effect of reducing the cycle time, thus 
improving the bandwidth. 

4.1 The Basic Matching Algorithm 

In this discussion each input is identified with the packet 
that just arrived at that input. Recall that an input i is com- 
patible with a memory m if the packet that just arrived at 
i can be stored in memory m without arrival and departure 
conflicts (see Section 2.2). 

At the beginning of a cycle, the time-stamp of each in- 
put port is broadcast to each memory and memories con- 
struct a list of inputs that are compatible with the memory. 
The algorithm then works in rounds according to the 'Basic 
Matching Process' given below. Anderson et al. [2] proposed 
a similar algorithm called Parallel Iterative Matching (PIM) 
for a completely different architecture, namely a crossbar- 
based input-queued router with "virtual output queues." In 
their case they need to compute a maximal matching in an 
arbitrary bipartite graph, and they prove that the expected 
number of rounds for their algorithm is 0(log JV). 

Initially all the memory banks are unmatched. 

Basic Matching Process: 

1. In parallel each unmatched memory sends a message 
to a random compatible input port. 

2. In parallel each input port i picks a memory bank j 
that sent it a message and assigns its current packet 
to that memory bank. It then broadcasts a bit to 
ail memory banks to inform them that it is no longer 
available to be matched (the bit sent to memory bank 
j is a 1 and the bit sent to all other processors is 0). 

3. In parallel each memory bank that receives a 1-bit from 
its matched input decrements a counter initially set to 
s. If the counter goes down to zero, the processor 
declares itself matched. 

4. 1. 1 Analysis of the Basic Matching Algorithm 

In this section we establish that if M = (N+ \N/s] -f tiV), 
for any e > l/2 log "^, then w.h.p. in A/, the number of 
rounds needed to match every input to a compatible mem- 
ory bank is 0(log*" N). The analysis views the computa- 
tion in the 'balls-in-bins' framework, and the slight excess 
in the number of available memory banks over the bound of 
{N + \(N — 1)/^1) given in section 2.2 allows for the accel- 
eration in the matching process in successive rounds lead- 
ing to the 0(log* N) bound. Randomized strategies with 
0(log* N) complexity are known in the literature for other 
scenarios, e.g., in the context of highly-parallel algorithms 
for the CRCW PRAM [18] and in emulating shared-memory 
on distributed memory (see, e.g., [8]), and our strategy is 
similar to these in terms of accelerating progress in suc- 
cessive rounds. However, our framework and analysis are 
different. Our main theorem is proved through a sequence 
of lemmas. 

LEMMA 4.1. // there are k unmatched inputs at a begin- 
ning of a round then there must be (ciV + \k/s] ) unmatched 
compatible memory banks for each input. 
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Proof. A memory bank could be unavailable for a given 
input because of two reasons, either because there is al- 
ready a packet in that memory (either stored in previous 
cycles or matched to that memory for current cycle) that 
has the same time-stamp or because s other inputs have 
been already matched to that memory. There could be at 
most N — 1 packets with same time stamp, that could elim- 
inate N — 1 memories as potential match. Since N — k 
inputs have been matched to memories, there could be at 
most [(N — k)/s\ memories that have been matched to s 
inputs. This could further eliminate at most [(N — k)/s\ 
memories as a potential match. Thus we will have at least 
M - (N - 1) - [(N - k)/s\ > eN + \k/s] memories that are 
compatible with a given input. □ 

Define a round that starts with k unmatched inputs to be 
successful ii it ends with at most fc e " (l/a " l " cN/fc) +y/2M logM 
unmatched inputs. In the following lemma we prove that 
w.h.p. a round is successful. 

LEMMA 4.2. If there are k unmatched inputs and M mem- 
ories at the beginning of a round and each input can be 
matched to at least eN + \kjs\ memories, then the expected 
number of unmatched inputs at the end of that round is at 
most ke~ il/s+eN/k K Furthermore the probability that the 
number of unmatched inputs exceeds its mean by more than 
y/2M log M is at most jj. 

Proof. First we bound the expectation. Let v{rn) be the 
set of unmatched inputs that can be matched to memory to 
and let rj(i) be the set of unmatched memories that can 
be matched to input i. Clearly \v(m)\ < k and \r](i)\ > 
eN + k/s. 

Let Cm be the index of the input to which memory i sends 
a request. Thus Pr[C m = j\ = l/|i/(m)| if j G i/(m) and 0 
otherwise. Let C = (C\ , Ci . . . , Cm) and define the random 
variable Xi(C) to be 1 if Vj. (Cj / i) and 0 otherwise. In- 
formally Xi(C) indicates that input i did not get a request 
from any of the memories. Since an input is matched if and 
only if it gets a request from at least one of the memories, 
Xi(C) = 1 implies input i did not get a match in that round. 
Let X(C) = Xi(C) be the total number of unmatched 
inputs at the end of the round. Then, 

E(X(C)) = k(l - i/Jfc)<«"+r*/-l> 

< ke^ 1/a ^ N/k) . 

We now use Azuma's inequality [1] to bound the prob- 
ability of deviation. Let us define a sequence of random 
variables Yo through Ym as follows 

Y m (C) = E(X(C)\C U C 2 , . • . , C m _i). 
In particular, Vo(C) is equal to the constant E(X(C)) and 
y M (C) is identical to X(C). Since E(Y m |y m _i) = V m -i the 
sequence of random variables Y m is a martingale. Further- 
more if C and C differ in choice of only one memory then 
that memory could choose at most one input that was not 
chosen by any other memory. Thus the difference in number 
of unmatched inputs can be at most one. Hen ce by Azum a's 
inequality we have Pr [X(C) > E{X{C)) + v^Mlog M\ < 

& □ 

Since 1/M < 1/iV, the first 0(log* N) rounds are success- 
ful w.h.p. in N. The following discussion assumes that they 
are successful. 



Let kr be the number of unmatched inputs at the begin- 
ning of round r. We know that fco = N and kr decreases 
in successive rounds. Let R be the last round for which 
kft > W^2M log M, where W is a constant chosen to ensure 

that Av-k < (kr/a)e~& for r < 71, where 1< a < e l/ *. We 
will prove that R = 0(log* N). (Note that by Lemma 4.1 
and a Chernoff bound, w.h.p. in N all inputs are matched 
in round 71+1.) 

For a > 1 and integer i > 0 we define a ft i = </(u,i), 
where y(a,0) = 1 and y(a,i) = a p(<M_l) for i > 0. 

LEMMA 4.3. For every constant c > 0 there exists a con- 
stant b — c e / c such that if there are k unmatched packets at 
the beginning of a round r < R and for some positive integer 
i we have k < then the number of unmatched inputs at 
the end of that round is at most a ^^ i+l ^ , w.h.p. in N. 

Proof. The number of unmatched inputs at the end of 
round is at most < = OTTO' D 

From Lemma 4.3 it trivially follows that kr+i < kr/a. 
Let A — flog a Hence after A initial rounds we have 

kA < Ne/\n2. Now substituting c = e/ln2 in Lemma 4.3 
we have 6 = 2, and hence k r < (|n 2 j(2tti) i m P aes ^ 

k r < N 

a(2TT(i-H)) - (2TT(i+l))' 

Since kA < Ne/ In 2, applying the above inequality repeat- 
edly we obtain k r+ A < a ^^ r ) • Thus at the end of A+\og* N 
rounds we cannot have more than W\/2M log M unmatched 
inputs. Since Wy/2M log M inputs can be matched in a 
single round w.h.p. in N> we can match all the inputs in 
,4 + log* N-r I = 0(log* N) rounds, if e - fi(l/2< l ^ log * N ). 
This gives us the following theorem. 

THEOREM 4.4. If the router can transfer s packets to each 
memory in a cycle, then if M = N+ \N/s \ + 2 (i/af[o 8 « n )> 
repeated applications of the basic matching process will match 
all inputs to memories in 0(log* AT) rounds with high prob- 
ability in N. 

4.2 Pipelined Randomized Scheduler 

The scheduling algorithm described in the previous sec- 
tion uses 0(log* N) rounds of the basic matching procedure. 
Thus the cycle time must be sufficiently long to be able to 
complete these 0(log* N) rounds, and as N increases the 
cycle time must increase resulting in a drop of throughput. 
In this section we address this drawback by presenting a 
pipelined scheduler that executes each cycle in a constant 
number of rounds. 

The pipelined scheduler uses multiple cycles to construct 
a matching for each set of packets that arrive together. How- 
ever matchings are constructed for multiple sets of packets 
simultaneously in a pipelined fashion. Consequently, the 
amount of computation per cycle reduces but packets wait 
for D cycles at the inputs before they are transferred to 
the memories. The value D is the latency of the pipelined 
scheduler (we will show later that D = 0(log* N)). The 
input buffer size J equals D, and packets are stored FIFO. 

Let Pi through P°o be the packets destined for output 
port o that arrived during cycle T and let them be ordered 
according to the id of the input port they arrived. We 
maintain an array earliest[l • • • N] to keep track of earliest 
time-stamp available for any output, after taking . latency 
into account. The time-stamp of packet P° is then set to 
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earliest[o\+i+D and earliest[o] is updated to max( earliest [o]+ 

In cycle T the packets that arrived between cycles T - D 
and T are in the input buffers and at the end of cycle T 
the packets that arrived at cycle T — D that are matched 
are transferred to the memories. Each input port will have 
an initial sequence of packets in its buffer that have been 
matched to some memory by the scheduling algorithm in 
earlier iterations, and the remaining packets are not yet 
matched by the scheduling algorithm. At any point in the 
scheduling algorithm, the first unmatched packet in each 
buffer is the active packet for the step, and the basic match- 
ing process will be applied to the set of active packets. 

-Let the current cycle be T. A stage, of the pipeline ex- 
ecutes the three steps in the following pipelined matching 
procedure u; times, where u> is an integer constant to be 
defined later in the analysis. 

Pipelined Matching Procedure 

(a) The input ports perform a transmit step in which each 
input port broadcasts to all the memories the time- 
stamp of its active packet (as in the first scheduling 
algorithm) together with its arrival time mod D. 

(b) The basic matching procedure is executed in parallel 
for each distinct arrival time to match some of the 
active packets to memory banks. Note that since each 
input can have at most one active packet, at most one 
message goes between any memory-input pair. 

(c) Each matched active packet is replaced by the first 
unmatched packet in its buffer. 

Finally, all matched packets that arrived in cycle T — D are 
transferred to the memory banks, and this concludes the 
stage. Any unmatched packet that arrived in cycle T — D is 
dropped. 

We show below that w.h.p. every packet that arrived in 
cycle T - D will be matched at the end of this stage. Note 
that the pipelined scheduling algorithm performs a constant 
number of rounds per stage. 

4.2.1 Analysis 

Our analysis assumes that M = (1 + + t)N, where 
t is an arbitrarily small positive constant. The complete 
analysis is in the Appendix. Here we present a simplified 
analysis for the case when e and s are both 1. Let z r — -^j^ . 
With s = l and e = 1, the number of unmatched inputs 
goes down by a factor of 2 w.h.p. in each iteration of the 
basic matching algorithm, and after r iterations of the basic 
matching algorithm in the non-pipelined setting, the number 
of unmatched packets is < z r if z r > y/N. Let D = log* N. 
Let u) = 2, i.e., a stage of the pipelined scheduler consists of 
two iterations of the pipelined matching procedure. 

Let Qi{T) be the set of input ports that have i unmatched 
packets at the start of cycle T, and let qi{T) = |Q»(T)|. Let 
Si{T) = EfLi <Ik{T). We define a predicate A 0 (T) to be true 
iff for all i < D, Si{T) < Zi. 

Theorem 4.5. // Ao(T - 1) is true then w.h.p. in N, 
A()(T) is true. 

Proof. Consider the start of cycle T. Note that for any 
input port with i unmatched packets, the number of packets 



that can be matched at that port during cycle T - 1 is 0, 1, 
or 2 (since we have assumed that uj — 2). Let n(T - 1) be 
the number of inputs that had i or i — 1 unmatched packets 
at the start of cycle T - 1 and have at least i - 1 unmatched 
packets at the end of cycle T - 1. Since one new packet 
arrives at each input port at the start of cycle T, we have 

D D 

*(T) = 2>(T) < £ qk (T-l) + ri (T-l) 

k=i fc=t+l 

< 5 i+ i(T-l) + 32 i . f i 

The last equation above uses the inequality r»(T — 1) < 
3z»+i. We can establish this as follows: 

Let ti\ be the number of inputs in Qi(T — 1) that are 
unmatched after the first iteration of stage T - 1, let X be 
the set of inputs that have i — 1 unmatched packets after the 
first iteration of stage T — 1, and let n 2 be the number of 
inputs in X that are unmatched after the second iteration 
of stage T - 1. Then n(T - 1) = m + n 2 . 

Since q*(T — 1) < Si(T — 1) < z» (by the induction as- 
sumption), we have n\ < z i+ i (since the number that did 
not receive a match in an iteration of stage T - 1 is the same 
as that derived for the basic scheduling algorithm since the 
pipelined algorithm executes the basic algorithm in parallel 
for each i.) 

For ri2 we note that \X\ = x\ +rc2, where x\ is the number 
of inputs that had i unmatched packets at the start of cycle 
T — 1, and have i — 1 unmatched packets after the first 
iteration, and x 2 is the number of inputs that had i — 1 
unmatched packets at the start of cycle T — 1 and continue 
to have i — 1 unmatched packets after the first iteration. 
Clearly, x\ < q%(T — 1), and xi < by the behavior of the 
basic matching process on inputs that had i — 1 matched 
packets at the start of cycle T - 1. Hence, \X\ <qi(T-l) + 
Zi < Zi + Zi < 2zi. Hence n 2 < 2z^i. Hence n < 3zi+i. 

So we have 

*i(T) < *i + i(T - 1) + 3z i+ i < 4zi+i < Zi 

□ 

COROLLARY 4.6. W.h.p. in N , all packets that arrived in 
cycle T — D have been matched by end of cycle T. 

Proof. From t lie theorem, qo{T - 1) = sd{T - 1) < 
max(pi>, VN) — y/N. During the first iteration of cycle T, 
the basic matching procedure is applied to these VN inputs. 
Hence w.h.p. in N all packets that arrived in cycle T - D 
are matched after this step, and certainly by the end of cycle 
T. □ 

Since Ao(0) is trivially true, by Theorem 4.5 we can argue 
inductively that A«(T) is true when T = 0{N). However as 
T grows large, the probability that A«(T) will continue to 
be true becomes small and then we can no longer guarantee 
that all the packets that arrived in cycle T - D will be 
matched at the end of cycle T. However our algorithm has 
a "self-stabilizing" property, i.e., if A«(T) becomes false for 
some T, within 0(log N) cycles the input queues get back 
to a state where the predicate Ao is true. 

Define a series of predicates Aj(T) such that Aj(T) is true 
iff for all i, Si(T) < {<f>) j pi for some constant <f> > 1. Note 
that Aj(T) implies A k {T) iik>j. 
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Theorem 4.7. Ifj > 0 and Aj(T-l) is true then, w.h.p, 
Ai-i(T). 

Proof. (Sketch.) Recall tliat in the proof of Theorem 4.5 
we proved that Si(T) < 3z*. Using a similar argument here 
we can prove if that Aj(T - 1) is true then s*(T) < 3(^2*. 
Now for c > 30 we get Si(T) < c4P~ x Zi. If j > 0 then 
so <po trivially. Hence Aj_i(T). □ 

• Now since Ai og ^ n (T) is always true, in log^ N steps we 
get back to a state where A 0 (T) is true. This establishes the 
self-stabilizing feature of our pipelined algorithm. 

The pipelined scheme that we have presented in this sec- 
tion requires the memory banks to send multiple requests in 
a single round (although only one message is sent along any 
wire). Further, each cycle consists of cj rounds and u can be 
a potentially large constant depending on s. We have devel- 
oped a simpler pipelined algorithm that uses 0(log log N) 
stages [3]. Although it uses a larger number of pipeline 
stages the advantage of this scheme is that each cycle con- 
sists of only 2 rounds and each memory bank needs to send 
only one request message in a round. 

5. MEMORY REQUIREMENTS 

The memories used to buffer packets contribute signifi- 
cantly to the total cost of a router. Thus it is important to 
minimize both the number of memories used, and the size 
of each memory. 

Routers need a large amount of memory in order to achieve 
low drop rates. Studies of Eckberg et al. [10] reveal that 
packet drop probability significantly decreases if memories 
can be shared across the queues for different outputs. Eck- 
berg et al. show that for a Poisson packet arrival process, 
the amount of buffer required to achieve a certain drop prob- 
ability when the arrival rate of packets is more than 90% of 
the total capacity of the router, reduces by a factor of 4 if a 
shared memory is used. 

In this section we establish that our schedulers make very 
effective use of memory. In section 5.1 we show that the 
total memory used by our schedulers is very close to the 
minimum needed. In section 5.2 we show that the number 
of memory banks used by our schedulers is also close to the 
best possible. ■ 

5*1 Load Balance 

One of the features of our algorithms is that they dis- 
tribute packets evenly across the memory banks. This en- 
ables us to achieve the effect of a pure shared memory. This 
is independent of any assumptions on the packet arrival pro- 
cess, as shown in the theorem below. 

THEOREM 5.1. Consider an SMS switch with N input 
and output poris, M memory banks, each of size K, and 
with each shared buffer supporting s unites per cycle. Let Q 
be given as an upper bound on the total number of packets 
in the memories in any cycle. If K > Q/M + y/2csZ log M, 
with c > 1, then w.h.p. in M both of our SMS schedulers 
can buffer packets for up to Z cycles without dropping any 
packets. 

Proof. The result follows through the use of Azuma's 
inequality on the martingale that considers the number of 
packets in any given memory bank in each cycle. 

Consider an arrival sequence of packets that leads to buffer- 
ing of a total of R packets at the end of T - 1 cycles. 



Let Ui be the map that maps packets that arrived at cy- 
cle T — 1 - Z + i to the memories in which they were stored. 
Let U = (C/i, Uz - • • Uz) and let V™. be the random variable 
denoting number of packets stored in memory m at the end 
of (T — l)-th cycle. Note that since all the packets in mem- 
ory arrived within last Z cycles, U has sufficient information 
to compute V m . Since there is no special bias for any of the 
memories, E(V m | R) = R/M. Define a sequence of random 
variables 

j W i = E{V m \UuU 2 ...U i ),0<i< Z 

\ where W 0 = E(V m ) and W z = V m . Since E(W1 | Wi-i) = 
Wi_i, the sequence of random variables W% is a martin- 
gale. Furthermore if U and U' differ in only one of the 
Ui for (T — Z — 1 + i)-th cycle, at most s packet could be 
stored in memory m in that cycle, and at most one can 
leave. Therefore Vm satisfies s— Lipschitz condition, i.e., 
|Kn(U) - V m (U')| < s. Thus using Azuma's inequality we 
obtain 

Pr [\W Z - W 0 \ > y/2csZ\o%M\ < e" 2c * log M /2b = ^ 
Thus, 

|J Pr [V m > R/M+^2csZ log M] < 

l<m<M 

Since R< Q and c > 1, we have the desired result w.h.p. 
in M. □ 

Corollary 5.2. Consider an SMS switch that emulates 
an output- queued switch with N ports and output buffer size 
L with M memory banks, each of size K, and each support- 
ing s shared writes per cycle. If K > LN/M + y/2csL logM, 
where c> 1 is a constant, then with high probability, both of 
our schedulers will not drop a packet that will not be dropped 
by that output- queued switch. 

Proof. Use L = Z and Q = LN in the theorem. □ 

Note that in general, an output queued switch will use 
a conservative value for L to allow for occasional bursts of 
traffic for a single output. Thus the value of Q in the above 
theorem is typically much smaller than LN, and hence our 
scheduler would typically make much better use of the mem- 
ory than a corresponding output- queued switch. Also, note 
that since typically Q/M >> M » logM, the value of 
K can be chosen to be only very slightly larger than Q/M, 
the minimum size needed, and the packet drop probability 
could be held very small even if Z is made very large. Note 
also that the value of Z in the above theorem is limited in 
only a weak way by the upper bound placed on the value on 
Q even if the value of K is to be held close to Q/M. 

5.2 Number of Memory Banks 

Even though the cumulative size of memories in an SMS 
architecture can be close to that of an output-queued router, 
having a large number of small memories is slightly more 
expensive than having a small number of large memories. 

We have shown that that (1 + \l/s] + t)N memories are 
sufficient for an SMS router with speedup s to emulate an 
output-queued router. It is natural to investigate how many 
memories are actually necessary. First we examine what an 
off-line algorithm can achieve. 



349 



LEMMA 5.3. // an algorithm has knowledge of the com- 
plete arrival sequence then N memories are sufficient to 
store the packets while satisfying arrival and departure con- 
flicts. 

Proof. Construct a bipartite multi-graph G{ V, W, E) in 
which the set of vertices V represent arrival times of pack- 
ets, the set of vertices W represent the departure times of 
packets and one edge (u, w) € E is present for every packet 
that arrives at time v and departs at time w. Since at most 
A7 packets arrive at any cycle and at most N packets de- 
part every cycle the maximum degree of any vertex in G 
is N. Thus by Birkhoff's theorem [28, Page 40] it can be 
edge-colored using N colors and packets belonging to every 
color-class can be stored in one memory. □ 

The requirement on N memories is also trivially a lower 
bound since there are potentially N new packets in a cycle. 

Of course, in the context of a router, the algorithm has 
to operate on-line. Now we look at the absolute minimum 
number of memory banks that is required if an adversary is 
allowed to place packets in the memories. 

LEMMA 5.4. // an adversary places packets in the mem- 
ory then it is necessary to have N + \{N — memories 
in order to satisfy arrival and departure constraints. 

Proof. Consider the case where at every cycle T < N 2 — 
1, exactly 2 packets arrive for output (Tmod(n— 1)) + 1, one 
packet arrives for every output o such that o ^ (Tmod(n - 
1)) + 1 and o < N, and no packet arrives for output N. 
At cycle N 2 — 1, the total number of arrivals at each output 
between 1 to N- 1 would be N 2 + N but the total number of 
packets that departed through each output would be N 2 — 1. 
Thus there would be N + 1 packets in the memory for each 
output from 1 to N — 1. Hence for each of the next N + 1 
cycles we will have N — 1 packets scheduled to depart. An 
adversary could choose a set B of N — 1 memories and place 
all of these packets into the memories in B such that each 
memory stores one packet of each time-stamp between N 2 
and N 2 + N. Now if N packets arrive all destined for output 
A/, then each packet will have a departure conflict with each 
memory in B. Thus all of these new packets must be stored 
in some memory that is not in B and no 2 packets can be 
stored in same memory. Therefore there must be additional 
TV memories. Hence we need 2A7 - 1 memories to store the 
packets. □ 

Since our algorithm controls the placement of packets in 
the memory it is possible that such an algorithm can make 
do with a smaller number of memory banks than the bound 
in Lemma 5.4. We now show that it is impossible for an SMS 
router with less than 9N/S memories to behave identically 
to an output-queued router, regardless of how sophisticated 
its scheduling algorithm is. In particular this means that 
we cannot achieve the off-line optimal behavior in the on- 
line case with only N memory banks, or even with iV + 6N 
memories, if S < 1/8. 

THEOREM 5.5. There is no deterministic algorithm that 
can match any sequence of packet arrivals to memories while 
satisfying arr ival and departure constraints if the number of 
memories is M — N + A and A < N/S. Furthermore, for 
any randomized algorithm there exists an arrival sequence 
for which it will fail with prvhahility at least 1/2. 



In order to prove the theorem we will use a set of lemmas 
that show that if we have a sequence of subsets of size close 
to half of the original set such that any two consecutive sets 
are disjoint, then any pair sets with even sequence number 
have a significant intersection. 

LEMMA 5.6. // X y y, Z C [N + A] such that \X\ = |Y| = 
\Z\ = N/2 andXHY = YnZ = 0 then \XC\Z\ > N/2- A. 

Proof. Since X C\Y = Y n Z = 0, both X and Z are 
subsets of Y c . Since all sets are subsets of [N -f A] and 
\Y\ - N/2, we have \Y C \ = N/2 + A. Hence the minimum 
size of X n Z is N/2 - A, i.e., \X 0 Z\ > N/2 - A. □ 

LEMMA 5.7. For any three sets X ,Y ,Z of size N/2 if\XC\ 
Y\ >N/2-a and \Y n Z\ > N/2 - (3 then X n Z > N/2 - 
a-/?. 

Proof. The result follows from the observation that \Xn 
Y n Z\ > N/2 - /? - a. [] □ 

LEMMA 5.8. For any series of sets So, Si • • • 52m € [N + 
A] if \Si\ = N/2 and Si Pi S i+i = 0 then, \So n S 2 m\ > 
A//2-mA. 

Proof. The base case when m = 1 follows from Lemma 5.6. 
Let the lemma be true for some rn = p. Thus \S\ O 52 P | > 
N/2 - pA and \S 2p n S 2p +2\ > N/2 - A. Therefore from 
Lemma 5.7 we get \Si n 5 2 ( p+ i)| = N/2 -(p+ 1)A. □ 

We can now prove Theorem 5.5. We will do so by defin- 
ing two packet arrival sequences such that based on choices 
made by any algorithm, an adversary can always choose one 
of the arrival sequence for algorithm to fail if A < N/S. 

Assume the number of outputs is even. Let Oi be a set 
of N/2 outputs and O2 be remaining set of outputs. Define 
tt< (bi) to be the set of packets that depart at time i and are 
destined for an output in 0\ (O2). Our arrival process is 
such that |ai| = N/2 or 0 and all the packets for any set a* 
arrive in the same cycle. Similarly = N/2 or 0 and all 
the packets in any set bi arrive in the same cycle. 

Now we will present two arrival sequences. The two ar- 
rival sequences are described in Table 1. Botli sequences 
have a common prologue till time 9 as described in the first 
column of the table. The second and third columns describe 
the packets that arrive in sequence 1 and sequence 2 respec- 
tively after prologue. A dash in the input column indicates 
that no packets arrived at those N/2 inputs. It is easy to 
verify that the time-stamp assignments are consistent with 
output queuing. We will use Al (B t * ) to represent the set 
of memories that the packet of Oi (bi) will be stored in, 
where the superscript * is either p, 1 or 2 based on whether 
the set of packets correspond to prologue, sequence 1 or, 
sequence 2 respectively. Since all the packets departing to- 
gether must be stored in different memories, if a* ^ 0 then 
\A;\ = \oh\ = N/2. Similarly if bi f 0 then |B*| = N/2. 

For notational convenience, we introduce the infix binary 
relational operator denoting set disjointness, i.e., U V 
iff U fl V = 0. From arrival time constraints we get A[ x 5A 
4? 2 > #12 *h #13 , B 2 2 <h #13 > al" 1 ^14 <fr #11, and from 
departure time constraints we get A\ x B?i , A* 2 B\ 2 , 
A** * B{ 3 , A** + and A 2 U *h B? 4 . 

Now since there are a total of N + A memories and A\ x 
is connected to B\\ through a chain of 8 }A relations, from 
Lemma 5.8 we set B\ x n A\ x > N/2 - 4 A. But we know 
that B 2 i O A p n = 0. Thus N/2 - 4A < 0 or A > N/S. 
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Therefore we conclude that if A < N/8 any deterministic 
algorithm will fail. Furthermore, if any randomized algo- 
rithm, chooses A\ x and such that it works correctly for 
sequence 1 with probability 6 then it must fail for sequence 2 
with probability 6. Thus the worst case probability of failure 
for any randomized algorithm is at least max(0, 1 -0) > 0.5. 
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Table 1: Adversarial arrival sequence. 



APPENDIX 

A. DETAILED ANALYSIS FOR SECTION 
4.2 

We now give a detailed analysis of the pipelined random- 
ized scheduler based on the pipelined matching procedure 
in section 4.2, for the case when a is a positive integer, and 
€ > 0 is an arbitrarily small constant. 

Define Zi = N/(2 TT Let D be the smallest inte- 
ger such that z D < (W + l)y/2M logM. Clearly D = 
0(log* N). Let Qi{T,t) be the set of input ports that have 
i unmatched packets at the start of t-th iteration of the 
pipelined matching procedure in cycle T, and let <fr(T, £) = 
|Qi(T,0|. Let = ££ =i g;(T,i). In the following we 

will use Lemma 4.2, and we will assume that each execution 
of basic matching procedure is successful. (This will occur 
with high probability.) 

We define a series of predicates Ao(T), .. m Ad(7). Pred- 
icate Aj(T) is defined to be true iff for all i < £>, Si(T, 0) < 
Zi~j, where z% = N if i < 0. Note that this is a refinement 
of the predicates Aj defined in the extended abstract (as are 
s i} q t and Qi). 

THEOREM A.l. There exist a suitable constant u) such 
that if each stage executes u> iterations of pipelined matching 
procedure then A 0 (T) implies Ao(T + 1) w.h.p. in N. 

In order to prove the above theorem we will first need the 
following lemma. 

LEMMA A.2. // < a and Si(T,t) < a + b then 

w.h.p. in N we must have 3i{T, t + 1) < a + be~ Ne/a+b /a. 

P roof, Let <n{T,t) = x and si+x(T,t) = y. If x < 
Wy/2M log M then at the end of that iteration w.h.p. in 
N all the inputs in Qi(T,t) will get matched, otherwise we 
will have at most xe~ Ne ^ x /a inputs with i unmatched in- 
puts that were also in Qi(T,t) (Lemma 4.2). Let 6 be the 
number of inputs that got matched in Qi+i(T } t) thus 



qi(T t t + 1) < xe- Nc t x /cx + 6 and s i+l (T,t + l) < y - 
6. Thus si(7\i + 1) = s i+ i(T,t + 1) + qi{T t t + 1) < y + 
xe'^/a. Thus, 

xe~ Ne/x 

*i(T,t + 1) < max (y + ). 

y<a, as4-y<a-|-6 Of 

It is straightforward to show that the function on the R.H.S. 
achieves its maxima at y = a and x = b. Substituting that 
we get the desired result. □ 

Substituting a = Zi+i, fe = z» — Zi+i and t = 0 in the 
above lemma we get s<(T, 1) < Zi+i + 1 . Since Zi+i < 

Zi/2 we get ^(T, 1) < where /? =°l/2 + l/2a < 1. 
Similarly 5t+i(T", 1) < /?Zt+i. Thus applying this argument 
repeatedly we get Si(T,f) < ft z^. Let ^ be a constant 
such that p 9 < min(l/2,e/lii2). Thus s<(T,y) < 2*/?* and 
8i+i(T,g) <z i+ ip 9 . 

Substituting a = Zi+\fi 9 and 6 = Zip 9 and £ = </ in 
Lemma A. 2 for the next iteration it is not difficult to show 
that 

*i{T,g + 1) < P g (zi+i + Zie~^) < Zi+i. 

Thus if we set a; > y + 1, We have s*_i ('-/', < 2*. Since at 
most one packet arrives in a cycle, ^(T+l , 0) < Si- 1 ('i', a;) < 
2». Hence Ao(T + 1) holds with high probability in N. 

LEMMA A.3. // A 0 (T) is true, w.h.p. in N, all packets 
that arrived in cycle T — D have been matched at the end of 
cycle T. 

Proof. Prom the definition of Ao(T) we get gu^O) = 
^d(T,0) < (W + l)V2MlogM. Thus w.h.p. in N all 
the inputs in Qd(T, 0) get matched in the first iteration 
of pipelined matching procedure. Thus qo(T y 1) = 0, i.e., no 
input has D unmatched packets. Thus all the packets that 
arrived T — D cycles earlier are matched. □ 

Since Ao(O) is trivially true, by Theorem A.l we can argue 
inductively that Ao(T) is true for T = O(N). However as T 
grows large, the probability that Ao(T) will continue to be 
true becomes small and then we can no longer guaranty that 
all the packets that arrived in cycle T — D will be matched 
at the end of cycle T. However if we set u > 2(g + 1) our 
algorithm becomes "self-stabilizing" , i.e., if Ao(T') becomes 
false for some T, then within D cycles the input queues get 
back to a state where the predicate A<> is true. 

Note that Aj(T) implies A k (T) if k > j. 

Theorem A.4. If j > 0 and Aj(T) is true Uien, w.h.p, 
Aj-i{T+ 1) is true. 

Proof. Recall that in the proof of Theorem A. l we proved 
that if 3i(T, 0) < Zi then 3i(T, g + 1) < z i+i . Using a similar 
argument if Si(T, 0) < Zi-j then Si(T y (g + 1)) < z*-j+i. If 
we apply another g + 1 iterations we get 5*(T, 2(g + 1)) < 
Zi_j+ 2 . Thus setting w = 2(g + 1), we get Si(T + 1,0) < 
m-i {T, 2(g + 1)) < Zi-j+i. Hence Aj_i(T + 1) holds. □ 

Since Ad{T) is trivially true, in D steps we get back 
to a state where Ao(T) is true. This establishes the self- 
stabilizing feature of our pipelined algorithm. 
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